26 July, 2019

Proudly supported by the
people of Western Australia
through Channel 7's Telethon

Statistical Modelling in R

Notes

order of included R functions:

  • lm()
  • summary()
  • ggplot() + statqq()
  • ggplot() + geom_smooth()
  • tidy(), augment(), glance()
  • summ(), plot_summs()
  • autoplot()
  • glm() (different families)
  • lmer()

What we cover

  • Linear Regression
  • Multiple Linear Regression
  • Generalized Linear Models
  • Briefly: Mixed effects model

Linear Regression

Linear Regression in R

In a linear regression, we aim to find a model:

  • that represents our data and
  • can give information about the association between our variables of interest. The command in R for a linear model is

lm(y ~ x, data = data).

This is called the formula notation in R.

! Tip in the tidyverse

  • The formula-type code with the data assignment at the end might seem non-tidy.
  • But, we can use the pipe %>% in the following way: data %>%
    lm(y ~ x, data = .)
  • with the “.” we assign the data in the pipe

Before we model

Data set summary

Let’s first have a look at the summary table of the example data set, by using the summary() command:

##       day1             day2               day3      
##  Min.   :-1.611   Min.   :  0.2396   Min.   : 0.00  
##  1st Qu.: 3.031   1st Qu.:  9.9763   1st Qu.:13.25  
##  Median : 5.619   Median : 15.9973   Median :19.37  
##  Mean   : 5.428   Mean   : 16.7252   Mean   :20.27  
##  3rd Qu.: 7.772   3rd Qu.: 20.6984   3rd Qu.:28.05  
##  Max.   :12.142   Max.   :143.8187   Max.   :39.03  
##                   NA's   :4          NA's   :16

Visualisation of data distributions

Helpful plots before modelling

Before we start with the linear regression model, we need to get an idea of the underlying data and its distribution. We know that the linear regression has the assumtptions:

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

QQ-plot:

tki_demo %>%
  gather(Days, measurement, day1:day3, factor_key = TRUE) %>%
  ggplot( aes(sample = measurement, color=Days)) + stat_qq() + facet_wrap(~Days)

Boxplots to check for outliers

Plot the variables

Linear Regression

The lm() function

In the plots, we could already see, that day 2 and day 3 seem to be associated. This is obvious when drawing a line in the plot. Let’s now perform a linear regression model in R.

lm(day2 ~ day3, data = tki_demo)

  • As said before, the first argument in the code is y, our outcome variable or dependent variable. In this case it is day2.

  • The second Argument is x, the independent variable. In our case: day3.

  • We also specify the data set that holds the variables we specified as x and y.

Linear Regression Results

Now we want to look at the results of the linear regression. So how do we get the p-value and \(\beta\)-coefficient for the association?

In R, we add the summary() function to the lm() function, like so:

summary(lm(y ~ x, data = data))

We can also store our model results in a variable:
model1 <- lm(y ~ x, data = data), and then use summary: summary(model1)

Example Results

## 
## Call:
## lm(formula = day1 ~ day3, data = tki_demo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1940 -2.2388  0.3604  2.2765  7.2805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.52452    0.88699   5.101 2.14e-06 ***
## day3         0.03819    0.04016   0.951    0.345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.225 on 82 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.0109, Adjusted R-squared:  -0.001158 
## F-statistic: 0.904 on 1 and 82 DF,  p-value: 0.3445

jtools and broom

Improving the accessibility of the lm() results

  • The output before contains a lot of relevant information, but it is not straighforward to access the individual parameters like p-values and betas.
  • The broom R package is in line with the “tidy” data handling in R and turns the linear model results into an easy accessible tibble format:
tidy(lm1)
## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)   4.52      0.887      5.10  0.00000214
## 2 day3          0.0382    0.0402     0.951 0.345

Improving output style

  • The broom package helps with the accessibility of the output, but the style of the output is not very appealing for a publiation or a report.
  • The jtools package helps with this and has other nice functionalities such as forrest plots for coefficients and confidence intervals:
export_summs(lm1)
Model 1
(Intercept) 4.52 ***
(0.89)   
day3 0.04    
(0.04)   
N 84       
R2 0.01    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Access more model info for diagnostics

model <- lm(day2 ~ day3, data = tki_demo)
head(augment(model))
.rownames day2 day3 .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 19.4  29.6  22.1 2.52 -2.7   0.0273 15.4 0.000451 -0.179 
3 22.8  39    27   4.05 -4.2   0.0703 15.4 0.00307  -0.285 
4 4.61 9.41 11.4 2.68 -6.84  0.0307 15.4 0.00327  -0.454 
5 19.9  14.7  14.2 1.99 5.63  0.017  15.4 0.0012   0.372 
7 20.3  24.6  19.4 1.92 0.823 0.0158 15.4 2.37e-05 0.0543
9 13.7  9.47 11.5 2.67 2.23  0.0305 15.4 0.000346 0.148 

Diagnostic Plots

autoplot(model)

Multiple Linear Regression

How to?

Multiple linear regression works with the same function in R :
lm(y ~ x + covar1 + covar2 + … + covarx , data = data)

export_summs(lm(day3 ~ day2 + male, data = tki_demo))
Model 1
(Intercept) 15.24 ***
(1.44)   
day2 0.15 *  
(0.06)   
maleTRUE 6.47 ***
(1.88)   
N 80       
R2 0.21    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Example 1: one model

With the demo data set:

export_summs(lm(day3 ~ day2 + male + smoker, data = tki_demo))
Model 1
(Intercept) 15.81 ***
(1.46)   
day2 0.16 ** 
(0.06)   
maleTRUE 7.11 ***
(1.89)   
smokerTRUE -3.67    
(2.03)   
N 80       
R2 0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Example 2: more models

With the demo data set:

lm1 <- lm(day3 ~ day2, data = tki_demo)
lm2 <- lm(day3 ~ day2 + male + smoker, data = tki_demo)
export_summs(lm1, lm2)
Model 1 Model 2
(Intercept) 17.25 *** 15.81 ***
(1.41)    (1.46)   
day2 0.16 **  0.16 ** 
(0.06)    (0.06)   
maleTRUE         7.11 ***
        (1.89)   
smokerTRUE         -3.67    
        (2.03)   
N 80        80       
R2 0.09     0.24    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Forrest plot to compare coefficients in the model

  • Often we want to visualise the coefficients in the model to see their impact on the outcome, or visualise the coefficient of specific variable in two models, that differ only in the adjusted covariates.
  • The jtools package has a nice function to do this very easily, utilising ggplot2:
    plot_summs()

Example 1: one model

plot_summs(lm1)

Example 2: more models

plot_summs(lm1, lm2)

Example 3: compare specific coefficients

plot_summs(lm1, lm2, coefs = "day2")

Getting more info: Confidence Intervalls

  • With jtools we can access more information from the model in an easy step.
  • Here, we access the confidence intervall and variance inflation factor (for multicollinearity testing), but leave out the p-values:
summ(lm2, scale = TRUE, vifs = TRUE, part.corr = TRUE, 
     confint = TRUE, pvals = FALSE)$coeftable
##                  Est.       2.5%      97.5%    t val.      VIF  partial.r
## (Intercept) 18.569528 16.2225902 20.9164657 15.758586       NA         NA
## day2         2.568780  0.7826129  4.3549470  2.864328 1.021056  0.3121443
## maleTRUE     7.107081  3.3480942 10.8660688  3.765636 1.041817  0.3965366
## smokerTRUE  -3.665084 -7.7050377  0.3748704 -1.806864 1.054610 -0.2029483
##                 part.r
## (Intercept)         NA
## day2         0.2862919
## maleTRUE     0.3763784
## smokerTRUE  -0.1805975

Generalized linear models

Principles

  • The generalized linear model works very similar to the lm() function. We use the glm() function instead.
  • The lm() function always assumes a linear distribution, whereas in the glm() function, we have to assign the distribution (“link”) and a family type ourselves.
  • For a logistic regression, we have to assign: family = binomial(link = “logit”)
  • For a poisson regression, we assign: family = poisson(link = “log”)

Code and output: Logistic regression

export_summs(glm(smoker ~ male, data = tki_demo, family = binomial(link='logit')))
Model 1
(Intercept) -1.20 ***
(0.29)   
maleTRUE 0.68    
(0.46)   
N 100       
AIC 120.41    
BIC 125.62    
Pseudo R2 0.03    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Code and output: Poisson Regression

export_summs(glm(smoker ~ male,data = tki_demo, family = poisson(link='log')))
Model 1
(Intercept) -1.47 ***
(0.26)   
maleTRUE 0.48    
(0.38)   
N 100       
AIC 129.74    
BIC 134.95    
Pseudo R2 0.02    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Linear mixed effects model

The lme4 package

When we want to also account for random effects or clusters such as intervention if the effect of this is not what we are interested in, or study center in the case of a multi-center study, we use the lme4 package.

The formula works in the following way:

  • lmer(y ~ x + (1 + x | randomEffect), data = data) , for random slope
  • lmer(y ~ x + (1 + x | randomEffect) + (1 | otherVariable), data = data), including otherVariable as a variable that has an impact on the slope

Example

library(lme4)
lmer(day1 ~ day2 + (1 + day2 | intervention), data = tki_demo )
## Linear mixed model fit by REML ['lmerMod']
## Formula: day1 ~ day2 + (1 + day2 | intervention)
##    Data: tki_demo
## REML criterion at convergence: 496.8176
## Random effects:
##  Groups       Name        Std.Dev.  Corr 
##  intervention (Intercept) 1.278e-04      
##               day2        5.528e-06 -1.00
##  Residual                 3.147e+00      
## Number of obs: 96, groups:  intervention, 3
## Fixed Effects:
## (Intercept)         day2  
##     4.79533      0.03436  
## convergence code 0; 1 optimizer warnings; 0 lme4 warnings

Extra: Simple tests

t-test, wilcoxon-test, chi-square-test

  • t-test:
    • t.test(datx, daty)
  • wilcoxon test:
    • wilcox.test(datx, daty)
  • chi-square-test:
    • chisq.test(group1, group2)

Summary

Models

  • Linear regression and multiple linear regression both work by using lm()
  • Models with for other distributions (e.g. logistic regression):
    • glm()
    • specify family and
    • link
  • Linear mixed effects model: lmer()

Tidy results

  • All models work with the broom and jtools packages
  • tidy results: tidy()
  • more results: e.g. augment(), glance()

Publication ready

  • With the jtools package
  • export_summs() for regression table
  • plot_summs() for forest plots

The end of: modelling in R